This document aims to provide further examples in how to use the hpgltools.
Note to self, the header has rmarkdown::pdf_document instead of html_document or html_vignette because it gets some bullcrap error ‘margins too large’…
Here are the commands I invoke to get ready to play with new data, including everything required to install hpgltools, the software it uses, and the fission data.
if (! "BiocInstaller" %in% installed.packages()) {
source("http://bioconductor.org/biocLite.R")
}
if (! "devtools" %in% installed.packages()) {
biocLite("devtools")
}
if (! "hpgltools" %in% installed.packages()) {
devtools::install_github("elsayed-lab/hpgltools")
}
if (! "fission" %in% installed.packages()) {
biocLite("fission")
}
## I use the function 'sm()' to quiet loud functions.
tt <- sm(library(fission))
tt <- sm(data(fission))
All the work I do in Dr. El-Sayed’s lab makes some pretty hard assumptions about how data is stored. As a result, to use the fission data set I will do a little bit of shenanigans to match it to the expected format. Now that I have played a little with fission, I think its format is quite nice and am likely to have my experiment class instead be a SummarizedExperiment.
## Extract the meta data from the fission dataset
meta <- as.data.frame(fission@colData)
## Make conditions and batches
meta$condition <- paste(meta$strain, meta$minute, sep=".")
meta$batch <- meta$replicate
meta$sample.id <- rownames(meta)
## Grab the count data
fission_data <- fission@assays$data$counts
## This will make an experiment superclass called 'expt' and it contains
## an ExpressionSet along with any arbitrary additional information one might want to include.
## Along the way it writes a Rdata file which is by default called 'expt.Rdata'
fission_expt <- create_expt(metadata=meta, count_dataframe=fission_data)
## Reading the sample metadata.
## The sample definitions comprises: 36, 7 rows, columns.
## Matched 7039 annotations and counts.
## Bringing together the count matrix and gene information.
There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
fis_libsize <- plot_libsize(fission_expt)
fis_libsize$plot
## Here we see that the wild type replicate 3 sample for 15 minutes has fewer non-zero genes than all its friends.
fis_nonzero <- plot_nonzero(fission_expt, labels="boring", title="nonzero vs. cpm")
fis_nonzero$plot
fis_density <- plot_density(fission_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fis_density$plot
fis_density$condition_summary
## condition min 1st median mean 3rd max
## 1: wt.0 0.5 31 216 1839 647 5346309
## 2: wt.15 0.5 20 135 1819 474 6733064
## 3: wt.30 0.5 27 183 1917 559 6722419
## 4: wt.60 0.5 27 195 1676 627 5034452
## 5: wt.120 0.5 25 183 1540 529 4531663
## 6: wt.180 0.5 29 210 1904 600 7223993
## 7: mut.0 0.5 35 238 1826 716 4818286
## 8: mut.15 0.5 22 142 1582 461 4387929
## 9: mut.30 0.5 33 226 2028 684 6061779
## 10: mut.60 0.5 39 279 1981 772 5268796
## 11: mut.120 0.5 35 242 1763 675 4174104
## 12: mut.180 0.5 24 178 1534 510 4276682
fis_density$batch_summary
## batch min 1st median mean 3rd max
## 1: r1 0.5 30 205 1908 632 6733064
## 2: r2 0.5 28 193 1727 586 7223993
## 3: r3 0.5 27 196 1717 600 6722419
fis_density$sample_summary
## sample min 1st median mean 3rd max
## 1: GSM1368273 0.5 46.0 306 2225.5 869.0 4542884
## 2: GSM1368274 0.5 21.0 147 1344.9 411.0 4117160
## 3: GSM1368275 0.5 34.0 243 1946.5 678.0 5346309
## 4: GSM1368276 0.5 34.0 226 2625.1 751.5 6733064
## 5: GSM1368277 0.5 20.0 123 1470.8 393.0 4094540
## 6: GSM1368278 0.5 13.0 95 1360.3 312.0 4714248
## 7: GSM1368279 0.5 38.0 257 2321.0 754.5 5624485
## 8: GSM1368280 0.5 19.0 138 1528.6 406.0 4966659
## 9: GSM1368281 0.5 26.0 182 1902.2 517.0 6722419
## 10: GSM1368282 0.5 15.0 106 996.5 301.0 3160098
## 11: GSM1368283 0.5 50.0 357 2451.7 1009.5 5034452
## 12: GSM1368284 0.5 31.0 229 1578.8 636.0 3781322
## 13: GSM1368285 0.5 33.0 238 1784.7 665.0 3855298
## 14: GSM1368286 0.5 23.0 170 1386.8 481.5 3419088
## 15: GSM1368287 0.5 21.0 161 1447.8 452.0 4531663
## 16: GSM1368288 0.5 32.0 224 1853.6 614.5 4562092
## 17: GSM1368289 0.5 35.0 252 2351.9 704.0 7223993
## 18: GSM1368290 0.5 22.0 172 1505.2 473.0 4711077
## 19: GSM1368291 0.5 23.0 154 1325.8 439.0 3317558
## 20: GSM1368292 0.5 39.0 267 1821.3 738.5 4093207
## 21: GSM1368293 0.5 52.0 357 2330.7 1009.5 4818286
## 22: GSM1368294 0.5 19.0 120 1473.7 385.0 3656215
## 23: GSM1368295 0.5 25.5 168 1713.5 521.5 4387929
## 24: GSM1368296 0.5 22.0 147 1557.8 469.0 4337853
## 25: GSM1368297 0.5 37.0 255 2580.2 775.5 6061779
## 26: GSM1368298 0.5 28.0 189 1721.2 580.0 4262642
## 27: GSM1368299 0.5 35.0 239 1781.4 713.0 3280103
## 28: GSM1368300 0.5 41.0 294 2026.0 824.5 3590986
## 29: GSM1368301 0.5 36.0 248 1845.1 674.0 5268796
## 30: GSM1368302 0.5 39.0 302 2072.8 833.5 5229526
## 31: GSM1368303 0.5 36.0 250 1884.7 699.5 3935733
## 32: GSM1368304 0.5 33.0 224 1670.9 612.0 4174104
## 33: GSM1368305 0.5 36.0 260 1733.5 725.0 3387186
## 34: GSM1368306 0.5 29.0 203 1798.8 582.5 4276682
## 35: GSM1368307 0.5 26.0 184 1416.2 513.5 3266544
## 36: GSM1368308 0.5 19.0 153 1385.7 433.0 4232664
## sample min 1st median mean 3rd max
In most cases, raw data does not cluster very well, lets see if that is also true for the fission experiment. Assuming it doesn’t, lets normalize the data using the defaults (cpm, quantile, log2) and try again.
## Something in this is causing a build loop on travis...
## Unsurprisingly, the raw data doesn't cluster well at all...
fis_rawpca <- plot_pca(fission_expt, expt_labels=fission_expt$condition, cis=NULL)
fis_rawpca$plot
## So, normalize the data
norm_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant", convert="cpm"))
## And try the pca again
fis_normpca <- plot_pca(norm_expt, plot_labels="normal", title="normalized pca", cis=NULL)
fis_normpca$plot
summary(fis_normpca)
## Length Class Mode
## pca 3 -none- list
## plot 9 gg list
## table 8 data.frame list
## res 4 data.frame list
## variance 35 -none- numeric
In the final line of the preceeding block, I printed a summary of the return from plot_pca(). It contains the following information:
With that in mind, lets perform some more pca plots after normalizing the data and see how different they look.
normbatch_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant",
convert="cpm", batch="sva"))
fis_normbatchpca <- plot_pca(normbatch_expt,
title="Normalized PCA with batch effect correction.", cis=NULL)
fis_normbatchpca$plot
## ok, that caused the 0, 60, 15, and 30 minute samples to cluster nicely
## the 120 and 180 minute samples are still a bit tight
## pca_information provides some more information about the call to
## fast.svd that went into making the pca plot
fis_info <- pca_information(norm_expt,
expt_factors=c("condition","batch","strain","minute"),
num_components=6)
## More shallow curves in these plots suggest more genes in this principle component.
## The r^2 table shows that quite a lot of the variance in the data is explained by condition
head(fis_info$rsquared_table)
## prop_var cumulative_prop_var condition batch strain minute
## 1 36.642 36.64 99.14 0.017 0.123 98.832
## 2 14.010 50.65 96.67 0.382 0.006 95.833
## 3 8.432 59.08 66.40 25.100 1.834 61.469
## 4 5.807 64.89 38.72 26.589 6.086 29.265
## 5 5.179 70.07 88.88 5.137 0.316 87.439
## 6 3.633 73.70 15.16 0.950 7.738 4.997
## We can look at the correlation between the principle components and the factors in the experiment
## in this case looking at condition/batch vs the first 4 components.
fis_info$pca_cor
## PC1 PC2 PC3 PC4 PC5 PC6
## condition 0.31602 -0.269379 -0.05825 -0.3246 0.03743 0.28533
## batch 0.01233 -0.060604 0.50079 -0.5105 -0.22602 0.09691
## strain 0.03507 -0.007689 0.13542 -0.2467 0.05617 0.27817
## minute 0.57717 -0.530995 -0.35561 -0.2228 -0.02303 0.08811
## And p-values to lend some credence(or not to those assertions)
fis_info$anova_p
## PC1 PC2 PC3 PC4 PC5 PC6
## condition 0.0604265 0.1121139 0.735783 0.053393 0.8284 0.09163
## batch 0.9430865 0.7255040 0.001865 0.001468 0.1850 0.57394
## strain 0.8390728 0.9645021 0.431012 0.146923 0.7449 0.10044
## minute 0.0002285 0.0008625 0.033295 0.191462 0.8940 0.60937
## Try again with batch removed data
batchnorm_expt <- sm(normalize_expt(fission_expt, batch="limma", norm="quant",
transform="log2", convert="cpm"))
fis_batchnormpca <- plot_pca(batchnorm_expt, plot_title="limma corrected pca", cis=NULL)
fis_batchnormpca$plot
test_pca <- pca_information(batchnorm_expt,
expt_factors=c("condition","batch","strain","minute"),
num_components=6)
## More shallow curves in these plots suggest more genes in this principle component.
Interesting, the batch normalized pca plot looks much the same as the normalized. The variances are in fact pretty much the exact same…
We have some tools which provide visualizations of the distribution of the data:
fission_boxplot <- sm(plot_boxplot(fission_expt))
fission_boxplot
sf_expt <- sm(normalize_expt(fission_expt, norm="sf"))
fission_boxplot <- sm(plot_boxplot(sf_expt))
fission_boxplot
tm_expt <- sm(normalize_expt(fission_expt, norm="tmm"))
fission_boxplot <- sm(plot_boxplot(tm_expt))
fission_boxplot
rle_expt <- sm(normalize_expt(fission_expt, norm="rle"))
fission_boxplot <- sm(plot_boxplot(rle_expt))
fission_boxplot
up_expt <- sm(normalize_expt(fission_expt, norm="upperquartile"))
fission_boxplot <- sm(plot_boxplot(up_expt))
fission_boxplot
fission_density <- plot_density(norm_expt)
fission_density$plot
fission_density <- plot_density(sf_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fission_density$plot
fission_density <- plot_density(tm_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fission_density$plot
compare_12 <- plot_single_qq(fission_expt, x=1, y=2)
compare_12$log
Ok, so we can further check out how the data cluster with respect to one another…
fission_cor <- plot_corheat(norm_expt)
fission_cor$plot
fission_cor <- plot_corheat(batchnorm_expt)
fission_cor$plot
fission_dis <- plot_disheat(norm_expt)
fission_dis$plot
fission_dis <- plot_disheat(batchnorm_expt)
fission_dis$plot
A new toy I have been introduced to is variancePartition, let us try that with this data.
test_varpart <- varpart(fission_expt, predictor=NULL, factors=c("condition","batch"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.5 min
## Placing factor: condition at the beginning of the model.
test_varpart$percent_plot
test_varpart$partition_plot
## Here, let us test the variance contributed by strain, time, and replicate.
test_varpart <- varpart(fission_expt, predictor=NULL, factors=c("condition", "strain", "minute", "replicate"))
## Attempting mixed linear model with: ~ (1|condition) + (1|strain) + (1|minute) + (1|replicate)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.7 min
## Placing factor: condition at the beginning of the model.
test_varpart$percent_plot
test_varpart$partition_plot
YAY!
pander::pander(sessionInfo())
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: variancePartition(v.1.10.0), ggplot2(v.2.2.1), fission(v.0.114.0), SummarizedExperiment(v.1.10.1), DelayedArray(v.0.6.0), BiocParallel(v.1.14.1), matrixStats(v.0.53.1), Biobase(v.2.40.0), GenomicRanges(v.1.32.3), GenomeInfoDb(v.1.16.0), IRanges(v.2.14.10), S4Vectors(v.0.18.2), BiocGenerics(v.0.26.0), foreach(v.1.4.4), Vennerable(v.3.1.0.9000), ruv(v.0.9.7) and hpgltools(v.2018.03)
loaded via a namespace (and not attached): minqa(v.1.2.4), Rtsne(v.0.13), colorspace(v.1.3-2), colorRamps(v.2.3), rprojroot(v.1.3-2), htmlTable(v.1.12), corpcor(v.1.6.9), XVector(v.0.20.0), base64enc(v.0.1-3), rstudioapi(v.0.7), ggrepel(v.0.8.0), bit64(v.0.9-7), AnnotationDbi(v.1.42.1), codetools(v.0.2-15), splines(v.3.5.1), doParallel(v.1.0.11), DESeq(v.1.32.0), robustbase(v.0.93-0), geneplotter(v.1.58.0), knitr(v.1.20), ade4(v.1.7-11), Formula(v.1.2-3), nloptr(v.1.0.4), pbkrtest(v.0.4-7), packrat(v.0.4.9-3), annotate(v.1.58.0), cluster(v.2.0.7-1), graph(v.1.58.0), compiler(v.3.5.1), backports(v.1.1.2), Matrix(v.1.2-14), lazyeval(v.0.2.1), limma(v.3.36.1), acepack(v.1.4.1), htmltools(v.0.3.6), tools(v.3.5.1), gtable(v.0.2.0), GenomeInfoDbData(v.1.1.0), reshape2(v.1.4.3), Rcpp(v.0.12.17), gdata(v.2.18.0), preprocessCore(v.1.42.0), nlme(v.3.1-137), iterators(v.1.0.9), stringr(v.1.3.1), lme4(v.1.1-17), openxlsx(v.4.1.0), gtools(v.3.5.0), devtools(v.1.13.5), XML(v.3.98-1.11), edgeR(v.3.22.2), DEoptimR(v.1.0-8), directlabels(v.2018.05.22), zlibbioc(v.1.26.0), MASS(v.7.3-50), scales(v.0.5.0), BiocInstaller(v.1.30.0), doSNOW(v.1.0.16), RBGL(v.1.56.0), RColorBrewer(v.1.1-2), yaml(v.2.1.19), memoise(v.1.1.0), gridExtra(v.2.3), pander(v.0.6.1), rpart(v.4.1-13), latticeExtra(v.0.6-28), stringi(v.1.2.2), RSQLite(v.2.1.1), highr(v.0.6), genefilter(v.1.62.0), checkmate(v.1.8.5), caTools(v.1.17.1), zip(v.1.0.0), rlang(v.0.2.1), bitops(v.1.0-6), evaluate(v.0.10.1), lattice(v.0.20-35), htmlwidgets(v.1.2), labeling(v.0.3), bit(v.1.1-14), plyr(v.1.8.4), magrittr(v.1.5), DESeq2(v.1.20.0), gplots(v.3.0.1), snow(v.0.4-2), Hmisc(v.4.1-1), DBI(v.1.0.0), pillar(v.1.2.3), foreign(v.0.8-70), withr(v.2.1.2), mgcv(v.1.8-24), survival(v.2.42-4), RCurl(v.1.95-4.10), nnet(v.7.3-12), tibble(v.1.4.2), KernSmooth(v.2.23-15), rmarkdown(v.1.9), locfit(v.1.5-9.1), grid(v.3.5.1), sva(v.3.28.0), data.table(v.1.11.4), blob(v.1.1.1), digest(v.0.6.15), xtable(v.1.8-2), genoPlotR(v.0.8.7), munsell(v.0.4.3) and quadprog(v.1.5-5)